Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for #158 by using normpath #169

Merged
merged 5 commits into from
Mar 28, 2021
Merged

Fix for #158 by using normpath #169

merged 5 commits into from
Mar 28, 2021

Conversation

adegomme
Copy link
Collaborator

@adegomme adegomme commented Mar 16, 2021

Fixes #158

There is also a fix for a related issue published in aiida-bigdft, so version was bumped.

@sphuber
Copy link
Collaborator

sphuber commented Mar 17, 2021

Thanks @adegomme . Does this also fix the other problems? I tried checking out this branch, upgrading to aiida-bigdft==0.2.6 but it is still failing in the parsing of the BigDFTCalculation output:

+-> REPORT at 2021-03-17 14:37:52.721775+00:00
 | [454|BigDFTCalculation|on_except]: Traceback (most recent call last):
 |   File "/home/max/.virtualenvs/aiida/lib/python3.7/site-packages/plumpy/process_states.py", line 225, in execute
 |     result = self.run_fn(*self.args, **self.kwargs)
 |   File "/home/max/codes/aiida-core/aiida/engine/processes/calcjobs/calcjob.py", line 313, in parse
 |     exit_code_retrieved = self.parse_retrieved_output(retrieved_temporary_folder)
 |   File "/home/max/codes/aiida-core/aiida/engine/processes/calcjobs/calcjob.py", line 392, in parse_retrieved_output
 |     exit_code = parser.parse(**parse_kwargs)
 |   File "/home/max/.virtualenvs/aiida/lib/python3.7/site-packages/aiida_bigdft/parsers.py", line 63, in parse
 |     get_abs_path(output_filename))
 |   File "/home/max/.virtualenvs/aiida/lib/python3.7/site-packages/aiida_bigdft/data/__init__.py", line 116, in __init__
 |     self.logfile = path
 |   File "/home/max/.virtualenvs/aiida/lib/python3.7/site-packages/aiida_bigdft/data/__init__.py", line 132, in logfile
 |     self.bigdftlogfile = Logfiles.Logfile(path)
 |   File "/home/max/.virtualenvs/aiida/lib/python3.7/site-packages/BigDFT/Logfiles.py", line 392, in __init__
 |     raise ValueError("No log information provided.")
 | ValueError: No log information provided.

Edit: note, the above is with the fast protocol. With the moderate protocol, the BigDFTCalculation actually works, but it fails higher up in the workchain:

(aiida) max@501ca0e9c990:~/codes/aiida-common-workflows$ verdi process status 474
BigDftCommonRelaxWorkChain<474> Finished [400] [1:inspect_workchain]
    └── BigDFTRelaxWorkChain<475> Finished [101] [1:results]
        └── BigDFTBaseWorkChain<478> Finished [0] [2:results]
            └── BigDFTCalculation<479> Finished [0]

(aiida) max@501ca0e9c990:~/codes/aiida-common-workflows$ verdi process show 475
Property     Value
-----------  -----------------------------------------------
type         BigDFTRelaxWorkChain
state        Finished [101] Subprocess failed for relaxation
pk           475
uuid         ea80f4f1-ae5b-4530-ad84-34054095eb42
label
description
ctime        2021-03-17 14:38:49.026113+00:00
mtime        2021-03-17 14:39:01.882124+00:00
computer     [1] localhost

Inputs                 PK    Type
---------------------  ----  ----------------
relax
    algo               471   Str
    perform            470   Bool
    steps              473   Int
    threshold_forces   472   Float
clean_workdir          465   Bool
code                   2     Code
extra_retrieved_files  469   List
kpoints                468   Dict
max_iterations         464   Int
parameters             462   BigDFTParameters
pseudos                461   List
run_opts               463   Dict
show_warnings          466   Bool
structure              460   StructureData
structurefile          467   Str

Caller      PK  Type
--------  ----  --------------------------
CALL       474  BigDftCommonRelaxWorkChain

Called      PK  Type
--------  ----  -------------------
CALL       478  BigDFTBaseWorkChain

Log messages
---------------------------------------------
There are 1 log messages for this calculation
Run 'verdi process report 475' to see them

Why does the BigDFTRelaxWorkChain fail with a 101 if the BigDFTBaseWorkChain is perfectly fine?

@adegomme
Copy link
Collaborator Author

adegomme commented Mar 17, 2021

I was able to run the fast one without failure this weekend, on the docker version of qm. I did not have enough memory for the moderate one, but I was able to launch it anyway.

I think the first error is due to a crash during the run of BigDFT
Two options come to my mind :

  • out of memory: it's rather heavy with the Si testcase with the default parameters and kpoints (we should reduce it in the future), even in fast. I had to run it on a node with >24GB to be fine, and it was worse for moderate.
  • timeout may also be an issue as we have a single calculation for all the relaxation, the default 1h is too small for one node and the job is killed by slurm, I launched the job with 6 MPI processes and OMP_NUM_THREADS=2, and it was ~5h long, on a very old machine (sandy bridge dual xeon).

So this error means that no output was correctly generated (logfile file couldn't be retrieved/parsed) ?

It's a bit odd that the second one goes further and does not report a failure at the calculation level, if the first one failed this way...
But for this one, it looks like it did get back the logfile, but not the additional positions files generated at each step of relaxation and for the final one. The relax workchain will check that these files are retrieved (extra_retrieved_files) and can be parsed back. If not, it will return this error (it's a bit generic, so it prints in the report a more precise message before returning).
I would be interested in looking at the generated files in the computation folders (a data subfolder should be generated, , and the logfile to see if an error was raised somewhere, as well as the report of the RelaxWorkChain.

@sphuber
Copy link
Collaborator

sphuber commented Mar 19, 2021

I tried to rerun, but now it even blows up my machine. I am pretty sure it is due to the heavy memory usage as you said. I have 16 GB, but my machine stalls almost instantly and have to hard reboot it. Is this normal for BigDFT to use such a huge amount of memory for a small silicon system? Not sure how to test this now... @bosonie this together with the long walltimes required for CP2K, it seems like we have to come up with some kind of system to warn users about resource requirements. If they try in the QM and it fails like it does now, they are going to assume there is a problem with the code whereas it is simply a lack of requirements. I think we should definitely update the SI and documentation to give these required estimates of resources because otherwise it is going to cause problems I am sure

@bosonie
Copy link
Collaborator

bosonie commented Mar 19, 2021

@sphuber this has always been my concern and a clear warning is in the SI of the paper. However it is just a generic warning, not specific. We need to systematically gather data in order to be more precise.

@adegomme
Copy link
Collaborator Author

For Si, as we have to make it larger to run it for now (8 atoms instead of two), it's quite specific. Al should be lighter and takes much less time to run as well. But yes, this is too heavy for such small cases, I will see to remove some costly options in fast mode...

They were all usings settings for precise since a few versions, leading to long computation times/memory usage.
@adegomme
Copy link
Collaborator Author

K point values were badly computed since a few versions, resulting in too many k-points being used for fast and moderate protocol. These new values use much less memory and time.

@sphuber sphuber self-requested a review March 27, 2021 09:13
@sphuber
Copy link
Collaborator

sphuber commented Mar 27, 2021

Thanks for the updates @adegomme . I tried again and now the fast Si relax works without problems and in a reasonable amount of time. However, the moderate still fails for the same reason. The calcjob and base workchain finish successfully, but the relax workchain stops with a 101:

max@1ea98b5800bb:~/codes/aiida-common-workflows$ verdi process status 249
BigDftCommonRelaxWorkChain<249> Finished [400] [1:inspect_workchain]
    └── BigDFTRelaxWorkChain<250> Finished [101] [1:results]
        └── BigDFTBaseWorkChain<253> Finished [0] [2:results]
            └── BigDFTCalculation<254> Finished [0]

and the report

max@1ea98b5800bb:~/codes/aiida-common-workflows$ verdi process report 249
2021-03-27 08:31:38 [9  | REPORT]:     [253|BigDFTBaseWorkChain|run_process]: launching BigDFTCalculation<254> iteration #1
2021-03-27 09:32:08 [10 | REPORT]:     [253|BigDFTBaseWorkChain|finish]: BigDFT job<254> completed successfully
2021-03-27 09:32:08 [11 | REPORT]:     [253|BigDFTBaseWorkChain|results]: work chain completed after 1 iterations
2021-03-27 09:32:09 [12 | REPORT]:     [253|BigDFTBaseWorkChain|on_terminated]: remote folders will not be cleaned
2021-03-27 09:32:09 [13 | REPORT]:   [250|BigDFTRelaxWorkChain|results]: Relaxation failed - no output found
2021-03-27 09:32:09 [14 | REPORT]: [249|BigDftCommonRelaxWorkChain|inspect_workchain]: the `BigDFTRelaxWorkChain` failed with exit status 101

Looking at the outputs of the BigDftCalculation there indeed doesn't seem to be an output structure:

max@1ea98b5800bb:~/codes/aiida-common-workflows$ verdi process show 254
Property     Value
-----------  ------------------------------------
type         BigDFTCalculation
state        Finished [0]
pk           254
uuid         87fe76b9-5801-40d6-8639-35b59754ebec
label
description
ctime        2021-03-27 08:31:38.010283+00:00
mtime        2021-03-27 09:32:08.770355+00:00
computer     [1] localhost

Inputs                   PK  Type
---------------------  ----  ----------------
code                      2  Code
extra_retrieved_files   252  List
kpoints                 243  Dict
parameters              251  BigDFTParameters
pseudos                 236  List
structure               235  StructureData
structurefile           242  Str

Outputs           PK  Type
--------------  ----  -------------
bigdft_logfile   259  BigDFTLogfile
remote_folder    257  RemoteData
retrieved        258  FolderData

Caller          PK  Type
------------  ----  -------------------
iteration_01   253  BigDFTBaseWorkChain

but then again, that is also not the case for the run with fast protocol, which worked just fine. So I think the BigDftCalculation simply never attaches an output StructureData. This is another change that I can recommend for the sake of provenance.
I will attach the output logfile here so you can maybe check why the higher level relax workchain complains about there not being an output structure.

Actually, I figured it out. While preparing the output files to upload them here, I noticed that the job was killed by the scheduler. The problem is that the parser doesn't check this and simply happily returns a 0 exit code. It would be great if the parser could detect this and return a non-zero exit code. The standard that plugins have been using so far is ERROR_OUT_OF_WALLTIME(400) which you are free to also adopt. At least this way the base workchain can add a simple handler to restart the calculation

@sphuber
Copy link
Collaborator

sphuber commented Mar 27, 2021

@adegomme I have opened three issues on aiida-bigdft with some suggestions on how to improve the plugin. Feel free to contact me if you would like to discuss. I could definitely help out with implementing them if you think they are a good idea.

@sphuber
Copy link
Collaborator

sphuber commented Mar 28, 2021

Final wrap up: I have tested this branch on quantum-mobile:20.11.2a and have successfully run the relaxation on Si for the fast and moderate protocols. I also ran NH3-planar for the fast protocol which ran without issue. There are a few improvements to be done on the plugin side, but this can wait. I will merge this now as it seems most functionality required for the paper is working. Thanks a lot @adegomme !

@sphuber sphuber merged commit f93f5af into aiidateam:master Mar 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BigDFT: common relax workchain excepts because of missing pseudo potential file
3 participants